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Abstract 

We present a general framework for compar¬ 
ing multiple groups of documents. A bipar¬ 
tite graph model is proposed where document 
groups are represented as one node set and 
the comparison criteria are represented as the 
other node set. Using this model, we present 
basic algorithms to extract insights into sim¬ 
ilarities and differences among the document 
groups. Finally, we demonstrate the versatility 
of our framework through an analysis of NSF 
funding programs for basic research. 

1 Introduction and Motivation 

Given multiple sets (or groups) of documents, it is of¬ 
ten necessary to compare the groups to identify simi¬ 
larities and differences along different dimensions. In 
this work, we present a general framework to perform 
such comparisons for extraction of important insights. 
Indeed, many real-world tasks can be framed as a prob¬ 
lem of comparing two or more groups of documents. 
FI ere, we provide two motivating examples. 

1. Program Reviews. To better direct research efforts, 
funding organizations such as the National Science 
Foundation (NSF), the National Institutes of Health 
(NIH), and the Department of Defense (DoD), are of¬ 
ten in the position of reviewing research programs via 
their artifacts {e.g., grant abstracts, published papers, 
and other research descriptions). Such reviews might 
involve identifying overlaps across different programs, 
which may indicate a duplication of effort. It may 
also involve the identification of unique, emerging, or 
diminishing topics. A “document group” here could 
be defined either as a particular research program that 
funds many organizations, the totality of funded re¬ 
search conducted by a specific organization, or all re¬ 
search associated with a particular time period {e.g., fis¬ 
cal year). In all cases, the objective is to draw compar¬ 
isons between groups by comparing the document sets 
associated with them. 

2. Intelligence. In the areas of defense and intelli¬ 
gence, document sets are sometimes obtained from dif¬ 


ferent sources or entities. For instance, the U.S. Armed 
Forces sometimes seize documents during raids of ter¬ 
rorist strongholdsQ Similarities between two document 
sets (each captured from a different source) can poten¬ 
tially be used to infer a non-obvious association be¬ 
tween the sources. 


Of course, there are numerous additional examples 
across many domains {e.g., comparing different 
news sources, comparing the reviews for several 
products, etc.). Given the abundance of real-world 
applications as illustrated above, it is surprising, 
then, that there are no existing general-purpose 
approaches for drawing such comparisons. While 
there is some previous work on the comparison 
of document sets (referred to as comparative text 
mining), these existing approaches lack the generality 
to be widely applicable across different use case 
scenarios with different comparison criteria. More¬ 
over, much of the work in the area focuses largely 
on the summarization of shared or unshared topics 
among document groups {e.g., |Wan et al. (20TT] |, 
Huang et al. (201 il l. 


Campr and Jezek (2013| l, 


Wang et al. (2'0T2l l, |Zhai et al. (2004) 1). That is. 


the problem of drawing multi-faceted comparisons 
among the groups themselves is not typically ad¬ 
dressed. This, then, motivates our development of a 
general-purpose model for comparisons of document 
sets along arbitrary dimensions. We use this model for 
the identification of similarities, differences, trends, 
and anomalies among large groups of documents. We 
begin by formally describing our model. 


2 Our Formal Model for 

Comparing Document Groups 

As input, we are given several groups of documents, 
and our task is to compare them. We now formally 
define these document groups and the criteria used to 
compare them. Let D = {di, <^ 2 ,..., dAr} be a doc¬ 
ument collection comprising the totality of documents 
under consideration, where N is the size. Let be a 
partition of D representing the document groups. 

’See Document Exploitation (DOCEX) at 
http : / /en . wikipedia . org for more information. 















Definition 1 A document group is a subset Df G 
(where index i G {1... 

Each document group in , for instance, might 
represent articles associated with either a particular or¬ 
ganization (e.g., university), a research funding source 
(e.g., NSF or DARPA program), or a time period (e.g., a 
fiscal year). Document groups are compared using 
comparison criteria, , a family of subsets of D. 

Definition 2 A comparison criterion is a subset Df G 
(where index i G {1... \D^\}). 

Intuitively, each subset of represents a set of 
documents sharing some attribute. Our model allows 
great flexibility in how D‘^ is defined. For instance, 
might be defined by the named entities mentioned 
within documents (e.g., each subset contains docu¬ 
ments that mention a particular person or organization 
of interest). For the present work, we define by top¬ 
ics discovered using latent Dirichlet allocation or FDA 
dBlei et al., 2003| ). 

LDA Topics as Comparison Criteria. Probabilis¬ 
tic topic modeling algorithms like FDA discover la¬ 
tent themes (i.e., topics) in document collections. By 
using these discovered topics as the comparison cri¬ 
teria, we can compare arbitrary groups of documents 
by the themes and subject areas comprising them. Fet 
K be the number of topics or themes in D. Each 
document in D is composed of a sequence of words: 
di = {sii, Si 2 ,..., SiNi), where Ni is the number of 
words in di and i G {1.. .N}. V = IJfci f(di) is 
the vocabulary of D, where /(•) takes a sequence of 
elements and returns a set. FDA takes K and D (in¬ 
cluding its components such as V) as input and pro¬ 
duces two matrices as output, one of which is 9. The 
matrix 9 G is the document-topic distribution 

matrix and shows the distribution of topics within each 
document. Each row of the matrix represents a prob¬ 
ability distribution. is constructed using K sub¬ 
sets of documents, each of which represent a set of 
documents pertaining largely to the same topic. That 
is, for t G {1... AT} and i G {1... A^}, each subset 
Df G D^ is comprised of all documents di where 
t = argmax^ 9ij; 13 Having defined the document 
groups D^ and the comparison criteria D'^, we now 
construct a bipartite graph model used to perform com¬ 
parisons. 

A Bipartite Graph Model. Our objective is to com¬ 
pare the document groups in D^ based on D^. We do 
so by representing D^ and D^ as a weighted bipartite 
graph, G = {P, C, E, w), where P and C are disjoint 
sets of nodes, E is the edge set, and w : E ^ 
are the edge weights. Each subset of D^ is repre¬ 
sented as a node in P, and each subset of D’^ is rep- 

^ D'^ is also a partition of D, when defined in this way. 


resented as a node in C. Fet a : P ^ D^ and 
j3 : C ^ D^ be functions that map nodes to the doc¬ 
ument subsets that they represent. Then, the edge set 
E is {{u,v) I u G P,v G C,a(u) fl /3(w) f 0}, 
and the edge weight for any two nodes u G P and 
w S C is w{{u,v)) = |a(M) n /3(u)|. Concisely, each 
weighted edge in G between a document group (in P) 
and a topic (in C) represents the number of documents 
shared among the two sets. Figure [T] shows a toy illus¬ 
tration of the model. Each node in P is shown in black 
and represents a subset of D^ (i.e., a document group). 
Each node in C is shown in gray and represents a subset 
of D‘^ (i.e., a document cluster pertaining primarily to 
the same topic). Each edge represents the intersection 
of the two subsets it connects. In the next section, we 
will describe basic algorithms on such bipartite graphs 
capable of yielding important insights into the similar¬ 
ities and differences among document groups. 
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Figure 1: [Toy Illustration of Bipartite Graph Model.] 

Each black node (i.e., node G P) represents a document 
group. Each gray node (i.e., node G C) represents a clus¬ 
ter of documents pertaining primarily to the same topic. 


3 Basic Algorithms Using the Model 

We focus on three basic operations in this work. 


Node Entropy. Fet w be a vector of weights for all 
edges incident to some node v G E. The entropy H of 
vis: H(u) = -T,,P^log^^fpi), where p, = 
and i,j G {1... |w|}. A similar formulation was em¬ 
ployed in |Eagle et al. (20T0 l. Intuitively, if v G P, 
H(v) measures the extent to which the document group 
is concentrated around a small number of topics (lower 
values of E(v) mean more concentrated). Similarly, if 
V G C, it is the extent to which a topic is concentrated 
around a small number of document groups. 


Node Similarity. Given a graph G, there are many 
ways to measure the similarity of two nodes based 
on their connections. Such measures can be used to 
infer similarity (and dissimilarity) among document 
groups. However, existing methods are not well-suited 
for the task of document group comparison. The well- 















known SimRank algorithm Peh and Widom, 2002) 1 
ignores edge weights, and neither SimRank nor 
its extension, SimRank++ dAntonellis et al., 2008| l, 
scale to larger graphs. SimRank++ and AS- 
COS dChen and Giles, 2013| l do incorporate edge 
weights but in ways that are not appropriate for doc¬ 
ument group comparisons. For instance, both Sim- 
Rank-H- and ASCOS incorporate magnitude in the sim¬ 
ilarity computation. Consider the case where document 
groups are defined as research labs. ASCOS and Sim- 
Rank-H- will measure large research labs and small re¬ 
search labs as less similar when in fact they may pub¬ 
lish nearly identical lines of research. Finally, under 
these existing methods, document groups sharing zero 
topics in common could still be considered similar, 
which is undesirable here. For these reasons, we for¬ 
mulate similarity as follows. Let be a function 

that returns the neighbors of a given node in G. Given 
two nodes u,v G P, let {u)LlN^{v) andlet 

X : I ^ L"’” be the indexing function for We 

construct two vectors, a and b, where ak = w{u, x{k)), 
bk = w{v, x{k)), and k G I. Each vector is essentially 
a sequence of weights for edges between u,v G P and 
each node in Similarity of two nodes is mea¬ 

sured using the cosine similarity of their corresponding 
sequences, which we compute using a func¬ 
tion •). Thus, document groups are considered 

more similar when they have similar sets of topics in 
similar proportions. As we will show later, this sim¬ 
ple solution, based on item-based collaborative hlter- 
ing ( ISarwar et al., 200T| l, is surprisingly effective at in¬ 
ferring similarity among document groups in G. 

Node Clusters. Identifying clusters of related nodes in 
the bipartite graph G can show how document groups 
form larger classes. However, we hnd that G is typ¬ 
ically fairly dense. For these reasons, partitioning of 
the one-mode projection of G and other standard bipar¬ 
tite graph clustering techniques (e.g., |Dhillion (200 l] l 
and |Sun et al. (20091 ) are rendered less effective. We 
instead employ a different tack and exploit the node 
similarities computed earlier. We transform G into 
a new weighted graph G^ = {P, ,w^™) where 

= {(u, u) I u, u S P, sim{u, v) > ^}, f is a pre- 
dehned threshold, and w®*™ is the edge weight function 
(i.e., w®*™ = sim). Thus, G^ is the similarity graph 
of document groups. ^ = 0.5 was used as the threshold 
for our analyses. To hnd clusters in G^, we employ the 
Louvain algorithm, a heuristic method based on mod¬ 
ularity optimization ( |Blondel et al., 2008l l. Modularity 
measures the fraction of edges falling within clusters 
as compared to the expected fraction if edges were dis¬ 
tributed evenly in the graph dNewman, 2006| l. The al¬ 
gorithm initially assigns each node to its own cluster. 

is the index set of L”’”. 


At each iteration, in a local and greedy fashion, nodes 
are re-assigned to clusters with which they achieve the 
highest modularity. 

4 Example Analysis: NSF Grants 

As a realistic and informative case study, we utilize 
our model to characterize funding programs of the Na¬ 
tional Science Foundation (NSF). This corpus consists 
of 132,372 grant abstracts describing awards for basic 
research and other support funded by the NSF between 
the years 1990 and 2002 ( [Bache and Lichman, 2013| )EI 
Each award is associated with both a program element 
{i.e., funding source) and a date. We define document 
groups in two ways: by program element and by cal¬ 
endar year. Eor comparison criteria, we used topics 
discovered with the MALLET implementation of LDA 
( [McCallum, 2002| l using K = 400 as the number of 
topics and 200 as the number of iterations. All other 
parameters were left as defaults. The NSE corpus pos¬ 
sesses unique properties that lend themselves to exper¬ 
imental evaluation. Eor instance, program elements are 
not only associated with specific sets of research top¬ 
ics but are named based on the content of the program. 
This provides a measure of ground truth against which 
we can validate our model. We structure our analyses 
around specific questions, which now follow. 

Which NSF programs are focused on specific areas 
and which are not? When defining document groups 
as program elements (i.e., each NSE program is a node 
in P), node entropy can be used to answer this question. 
Table[T]shows examples of program elements most and 
least associated with specific topics, as measured by 
entropy. Eor example, the program 1311 Linguistics 
(low entropy) is largely focused on a single linguistics 
topic (labeled by LDA with words such as “language,” 
“languages,” and “linguistic”). By contrast, the Aus¬ 
tralia program (high entropy) was designed to support 
US-Australia cooperative research across many fields, 
as correctly inferred by our model. 


Low Entropy Program Elements 

Program 

Primary LDA Topic 

1311 Linguistics 

language languages linguistic 

4091 Network Infrastructure 

network connection internet 

High Entropy Program Elements 

Program 

Primary LDA Topic 

5912 Australia 

(many topics & disciplines) 

9130 Res. Improvements in Minority Instit. 

(many topics & disciplines) 


Table 1: [Examples of High/Low Entropy Programs.] 


Which research areas are growing/emerging? When 
defining document groups as calendar years (instead of 
program elements), low entropy nodes in G are topics 
concentrated around certain years. Concentrations in 

“^Data for years 1989 and 2003 in this publicly available 
corpus were partially missing and omitted in some analyses. 




































later years indicate growth. The LDA-discovered topic 
nanotechnology is among the lowest entropy topics 
{i.e., an outlier topic with respect to entropy). As shown 
in Figure|2] the number of nanotechnology grants dras¬ 
tically increased in proportion through 2002. This re¬ 
sult is consistent with history, as the National Nan¬ 
otechnology Initiative was proposed in the late 1990s to 
promote nanotechnology r&d| One could also mea¬ 
sure such trends using budget allocations by incorpo¬ 
rating the award amounts into the edge weights of G. 



Figure 2: [Uptrend in Nanotechnology.] Our model cor¬ 
rectly identifies the surge in nanotechnology R&D beginning 
in the late 1990s. 

Given an NSF program, to which other programs 
is it most similar? As described in Section [3 when 
each node in P represents an NSF program, our model 
can easily identify the programs most similar to a 
given program. For instance. Table |3 shows the top 
three most similar programs to both the Theoretical 
Physics and Ecology programs. Results agree with 
intuition. For each NSF program, we identified the 
top n most similar programs ranked by our sim{-, •) 
function, where n G {3,6,9}. These programs were 
manually judged for relatedness, and the Mean Aver¬ 
age Precision (MAP), a standard performance metric 
for ranking tasks in information retrieval, was com¬ 
puted. We were unsuccessful in evaluating alternative 
weighted similarity measures mentioned in Section [3] 
due to their aforementioned issues with scalability and 
the size of the NSF dataset. (For instance, the im¬ 
plementations of ASCOS dAntonellis et al., 2008| l and 
SimRank dJeh and Widom, 2002| l that we considered 
are available herelff) Recall that our sim{-,-) func¬ 
tion is based on measuring the cosine similarity be¬ 
tween two weight vectors, a and b, generated from 
our bipartite graph model. As a baseline for compar¬ 
ison, we evaluated two additional similarity implemen¬ 
tations using these weight vectors. The first measures 
the similarity between weight vectors using weighted 
Jaccard similarity, which is £*= 

max(afe,6|c) ' 

^See National Nanotechnology Initiative at 
http: / /en . wikipedia . org for more information. 

^See networkx.addon project at 
http://github.com/hhchenl105/ 


Wtd. Jaccard). The second measure is implemented 
by taking the Spearman’s rank correlation coefficient 
of a and b (denoted as Rank). Figure |3] shows the Mean 
Average Precision (MAP) for each method and each 
value of n. With the exception of the difference be¬ 
tween Cosine and Wtd. Jaccard for MAP(Q)3, all other 
performance differentials were statistically significant, 
based on a one-way ANOVA and post-hoc Tukey HSD 
at a 5% significance level. This, then, provides some 
validation for our choice. 


1245 Theoretical Physics 

1182 Ecology 

1286 Elementary Particle Theory 

1128 Ecological Studies 

1287 Mathematical Physics 

1196 Environmental Biology 

1284 Atomic Theory 

1195 Ecological Research 


Table 2: [Similarity Queries.] Three most similar programs 
to the Theoretical Physics and Ecology programs. 



MAP@3 MAP@6 MAPiag 


Figure 3: [Mean Average Precision (MAP).] Cosine simi¬ 
larity outperforms alternative approaches. 

How do NSF programs join together to form larger 
program categories? As mentioned, by using the sim¬ 
ilarity graph constructed from G, clusters of re¬ 
lated NSF programs can be discovered. Figure |4l for 
instance, shows a discovered cluster of NSF programs 
all related to the field of neuroscience. Each NSF pro¬ 
gram {i.e., node) is composed of many documents. 



Figure 4; [Neuroscience Programs.] A discovered cluster 
of program elements all related to neuroscience. 

Which pairs of grants are the most similar in the 
research they describe? Although the focus of this 
paper is on drawing comparisons among groups of 
documents, it is often necessary to draw comparisons 
among individual documents, as well. For instance, 
in the case of this NSF corpus, one may wish to iden¬ 
tify pairs of grants from different programs describing 






































highly similar lines of research. One common approach 
to this is to exploit the low-dimensional representa¬ 
tions of documents returned by LDA dBlei et al., 2003| ). 
Any given document di G D (where i G {1... A^}) 
can be represented by a K-dimensional probability vec¬ 
tor of topic proportions given by 0^*, the row of 
the document-topic matrix 0. The similarity between 
any two documents, then, can be measured using the 
distance between their corresponding probability vec¬ 
tors (i.e., probability distributions). We quantify the 
similarity between probability vectors using the com¬ 
plement of Hellinger distance: Hs{dx,dy) = 1 — 

72 \/E^Ii(v^- where a;,j/ e {1...N}. 

Unfortunately, identifying the set of most similar doc¬ 
ument pairs in this way can be computationally ex¬ 
pensive, as the number of pairwise comparisons scales 
quadratically with the size of the corpus. For the 
moderately-sized NSF corpus, this amounts to well 
over 8 billion comparisons. To address this issue, our 
bipartite graph model can be exploited as a blocking 
heuristic using either the document groups or the com¬ 
parison criteria. In the latter case, one can limit the 
pairwise comparisons to only those documents that re¬ 
side in the same subset of . For the former case, 
node similarity can be used. Instead of comparing each 
document with every other document, we can limit the 
comparisons to only those document groups of interest 
that are deemed similar by our model. As an illustrative 
example, out of the 665 different NSF programs cov¬ 
ering these 132,372 grant abstracts, the program 1271 
Computational Mathematics and the program 2865 Nu- 
ffieric, Symbolic, and Geometric Computation are in¬ 
ferred as being highly similar by our model. Thus, we 
can limit the pairwise comparisons to only such docu¬ 
ment groups that are similar and likely to contain sim¬ 
ilar documents. In the case of these two programs, the 
following two grants are easily identified as being the 
most similar with a Hellinger similarity {Hs) score of 
0.73 (only text snippets are shown due to space con¬ 
straints): 

Grant #1 

Program: 1271 Computational Mathematics 

Title: Analyses of Structured Computational 

Problems and Parallel Iterative Algorithms. 

Abstract: The main objectives of the re¬ 

search planned is the analysis of large scale 


structured computational problems and of the 
convergence of parallel iterative methods for 
solving linear systems and applications of these 
techniques to the solution of large sparse and 
dense structured systems of linear equations 

Grant #2 

Program: 2865 Numeric, Symbolic, and 

Geometric Computation 

Title: Sparse Matrix Algorithms on Dis¬ 
tributed Memory Multiprocessors. 

Abstract: The design, analysis, and imple¬ 
mentation of algorithms for the solution of 
sparse matrix problems on distributed memory 
multiprocessors will be investigated. The 
development of these parallel sparse matrix 
algorithms should have an impact of challeng¬ 
ing large-scale computational problems in 
several scientific, econometric, and engineering 
disciplines. 

Some key terms in each grant are manually highlighted 
in bold. As can be seen, despite some differences in 
terminology, the two lines of research are related, as 
matrices (studied in Grant #2) are used to compactly 
represent and work with systems of linear equations 
(studied in Grant #1). That is, despite such differences 
in terminology {e.g., “matrix” vs. “linear systems”, 
“parallel” vs. “distributed”), document similarity can 
still be accurately inferred by taking the Hellinger sim¬ 
ilarity of the LDA-derived low-dimensional represen¬ 
tations for the two documents. In this way, by exploit¬ 
ing the group-level similarities inferred by our model in 
combination with such document-level similarities, we 
can more effectively “zero in” on such highly related 
document pairs. 

5 Conclusion 

We have presented a bipartite graph model for draw¬ 
ing comparisons among large groups of documents. 
We showed how basic algorithms using the model can 
identify trends and anomalies among the document 
groups. As an example analysis, we demonstrated how 
our model can be used to better characterize and eval¬ 
uate NSF research programs. For future work, we plan 
on employing alternative comparison criteria in our 
model such as those derived from named entity recog¬ 
nition and paraphrase detection. 
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